Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
Motivation
The following papers motivated this case study.
Twenge JM, Cooper AB, Joiner TE, Duffy ME, Binau SG. Age, period, and cohort trends in mood disorder indicators and suicide-related outcomes in a nationally representative dataset, 2005-2017. J Abnorm Psychol.128,3 (2019):185-199. doi:10.1037/abn0000410
Olfson, M., Blanco, C., Wang, S., Laje, G. & Correll, C. U. National Trends in the Mental Health Care of Children, Adolescents, and Adults by Office-Based Physicians. JAMA Psychiatry. 71, 81 (2014):81-90. doi: 10.1001/jamapsychiatry.2013.3074.
The main findings of the first article are:
Rates of major depressive episode in the last year increased 52% 2005–2017 (from 8.7% to 13.2%) among adolescents aged 12 to 17 and 63% 2009–2017 (from 8.1% to 13.2%) among young adults 18–25.
Serious psychological distress in the last month and suicide-related outcomes (suicidal ideation, plans, attempts, and deaths by suicide) in the last year also increased among young adults 18–25 from 2008–2017 (with a 71% increase in serious psychological distress), with less consistent and weaker increases among adults ages 26 and over.
Cultural trends contributing to an increase in mood disorders and suicidal thoughts and behaviors since the mid-2000s, including the rise of electronic communication and digital media and declines in sleep duration, may have had a larger impact on younger people, creating a cohort effect.
While the main findings of the second article are:
Compared with adult mental health care, the mental health care of young people has increased more rapidly.
Between 1995-1998 and 2007-2010, visits resulting in mental disorder diagnoses per 100 population increased significantly faster for youths (from 7.78 to 15.30 visits) than for adults (from 23.23 to 28.48 visits) (interaction: P < .001).
Psychiatrist visits also increased significantly faster for youths (from 2.86 to 5.71 visits).
While depression appear to be on the rise for youths, youths also appear to be seeking more mental health care.
In this case study we will evaluate data related to depression episodes and mental health care to evaluate trends overtime. We will be using data from the National Survey on Drug Use and Health (NSDUH). This data was also used in the first study.
Main Questions
Our main questions:
- How have depression rates in American youth changed since 2002, according to the NSDUH data?
- Do mental health services appear to be reaching more youths? How have rates differed between different youth subgroups (gender, ethnicity)?
Learning Objectives
It may be a good idea to provide a link to Rstudio’s webpage. For the first few months using R, I did not differentiate between R and R Studio. It may be a good distinction to make at least implicitly by providing a link.
In this case study, we will determine the percent of youth in America that have had a major depressive episode in the past year since 2002. We will compare how different youth subgroups have changed over time (by age group (12-13,14-15, and 16-17), gender, ethnicity). We will especially focus on using packages and functions from the Tidyverse, such as rvest. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.

We will begin by loading the packages that we will need:
I made some modifications to the table below. The tidyverse package hyperlink referenced readr. I thought this was incorrect. I changed this to the tidyverse website and provided a different description. If this was indeed a typo, it may need to be fixed in other case studies.
| here |
to easily load and save data |
| tidyverse |
R packages for data science |
| rvest |
to scrape web pages |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
Context
According to other sources the rate of suicide has increased for most age groups in the United States over the past decade and a half.

While suicide does appear to be increasing amoung youths it also appears to be increasing amoung middle aged adults as well for both females and males.

According to the CDC:
Since 2008, suicide has ranked as the 10th leading cause of death for all ages in the United States. In 2016, suicide became the second leading cause of death among those aged 10–34 and the fourth leading cause among those aged 35–54.
So although sucide is on the rise for most age groups, sucide is one of the top two contributors to death for youths. This warrents further examination of mental health of American youths.

Limitations
Perhaps “underestimates in the p-values…” is not the correct way to phrase this. I would look for a better way to word this.
Wording for this section should be reviewed.
There are some important considerations regarding this data analysis to keep in mind:
We treat sample estimates—estimates of the true population value—as observed values. This produces understimates in the p-values of statistical tests conducted.
Furthermore, the sampling mechanism utilized can introduce selection bias in cases where the the sampling methods do not produce a representative sample.
Data is collected from human participants; this presents the potential for information bias, as there is the potential that partificipants in the sampling frame may for a variety of reasons report inaccurate information.
Data Import
Data is often made available online. Usually, the data we are interested in is made available for download on the page as a delimited text file. However, sometimes data is not made available in this manner.
How do we proceed in this scenario?
We can manually copy each cell of data, however, this process is often inefficient, subject to error, and not reproducible.
We can also use R for web scraping.
Web scraping is the process of extracting data from a website.
There are two main steps to web scraping:
Identify location of data that will be scraped
Save the webpage element to an object
We accomplish STEP 1 with our web browser.
We accomplish STEP 2 in the R programming environment.
I could not find the animation that I referred to on several occasions.
However, I was able to find the sources that I consulted to create the three step rvest process. They are included below
RStudio
Blog
The rvest package can be thought of as the pdftools package for webscraping. Upon pulling the data, additional wrangling will likely be required; but like the pdftools package, rvest streamlines the extraction process.
The two steps can be broken down even further:
- Identify location of data that will be scraped
- right-click to inspect element (webpage)
- hover pointer over components of element (webpage) until the data has been found
- copy Xpath of data sought
- Save webpage element to an object
- import html code for element (webpage)
- extract pieces (table) out of HTML documents (webpage) using Xpath
- parse the html table into a data frame
Below is a animated overview of the process.
Let’s go to the web page with all the tables we are interested in scraping.

Once on the webpage, there aren’t any visible options to download the data.
Right-click and select “Inspect”

A window opens.
This window allows us to glance at the internal mechanics of the webpage. To scrape the data from the webpage, we need to first learn a little bit about the components that make it thet web page it is.
Hovering our mouse over the elements of the webpage highlights the respective section of the webpage it represents. By hovering over several elements—and opening elements when the highlighted portion is too large—we can indentify the element that contains the data we are looking for.

Right click on the element and copy the XPath. We will need this XPath for the next step.

Now we can return to the R programming environment

I included the following line to help separate the process.
2) Save webpage element to an object
For the first question we intend to answer, the XPath is /html/body/div[4]/div[1]/table. We use this Xpath with functions from the rvest package to scrape data from the web.
I wanted to include the last slide/component of the GIF. However, I realized that the audience would also benefit from having an actual code chunk. As a result, this section may need some very minor reworking.

We need to:
- import html code for element (webpage)
- extract pieces (table) out of HTML documents (webpage) using Xpath
- parse the html table into a data frame
To do this:
- We import the html code using
rvest::read_html().
- We extract specific components of the webpage using
rvest::read_node().
- We convert this html table into a dataframe using
rvest::html_table().
The rvest package provides wrappers for the xml2 and httr packages. I was not sure whether to tag the following functions as rvest or xml2/httr. I will leave that decision to you..
Great! We have successfully scraped the data.
From here on, we will need to wrangle the data.
First, we need to repeat the above process for the other tables we are interested in.
We can create a function to accomplish this succinctly.
For some odd reason, calling the function() function with the base:: prefix causes an error.
We apply the function we created too the
Data Exploration and Wrangling
Now that we’ve imported the data, let’s see if we can wrangle a table. Since the data comes from a source that is well-maintained, it is likely that whatever steps we take to wrangle this first table will also be necessary in the wrangling of subsequent tables. This is because well-maintained data sources often format different datasets similarly. We can take advantage of this similarity to speed up the wrangling process.
Table11.1a
[1] 21 18
table11.1a <- table11.1a[-dim(table11.1a)[1],]
table11.1a <- table11.1a %>%
dplyr::na_if("nc") %>%
dplyr::na_if("--") %>%
dplyr::na_if("") %>%
dplyr::na_if("*")
table11.1a <- table11.1a %>%
tibble::as_tibble() %>%
dplyr::rename(MHS_setting = `Setting Where Mental Health ServiceWas Received`)
partA <- table11.1a %>%
dplyr::select(MHS_setting)
partB <- table11.1a %>%
select(-MHS_setting)
partA <- partA %>%
dplyr::mutate(MHS_setting = base::gsub("[[:digit:]]+|[\r\n]|[[:punct:]]|([[:blank:]])\\1+",
"",
MHS_setting))
partB <- partB %>%
mutate(dplyr::across(.cols = dplyr::everything(),
stringr::str_remove_all, "a")) %>%
mutate(dplyr::across(.cols = dplyr::everything(),
stringr::str_remove_all, ","))
base::rm(table11.1a)
table11.1a <- dplyr::bind_cols(partA,
partB)
table11.1a <- table11.1a %>%
tidyr::pivot_longer(cols = dplyr::contains("20"), names_to = "Year", values_to = "Number")
table11.1a <- table11.1a %>%
dplyr::filter(MHS_setting != "General Medicine") %>%
dplyr::filter(MHS_setting != "Juvenile Justice") #Leading lines with no data
table11.1a <- table11.1a %>%
mutate(across(c(Year, Number), as.numeric))
We will write a function to simplify this process.
The function needs to:
- remove the last row of the table
- get rid of certain patterns
- transition the data to long format
data_prep_settings <- function(TABLE, old_col, new_col, pivot_col){
TABLE <- TABLE[-dim(TABLE)[1],]
TABLE <- TABLE %>%
na_if("nc") %>%
na_if("--") %>%
na_if("") %>%
na_if("*")
TABLE <- TABLE %>%
as_tibble() %>%
rename({{new_col}} := {{old_col}})
partA <- TABLE %>%
select({{new_col}})
partB <- TABLE %>%
select(-{{new_col}})
partA <- partA %>%
mutate({{new_col}} := partA %>%
select({{new_col}}) %>%
dplyr::pull({{new_col}}) %>%
gsub("[[:digit:]]+|[\r\n]|[[:punct:]]|([[:blank:]])\\1+",
"", .))
partB <- partB %>%
mutate(across(.cols = everything(),
str_remove_all, "a")) %>%
mutate(across(.cols = everything(),
str_remove_all, ","))
rm(TABLE)
TABLE <- bind_cols(partA,
partB)
TABLE <- TABLE %>%
pivot_longer(cols = contains("20"), names_to = "Year", values_to = pivot_col)
TABLE
}
I included the following line to help separate the tables.
Table11.1a
We then apply this function to the table, ridding the table of headings and ensuring some of our commons are correctly of numeric class.
[1] 21 18
We write a function to simplify this process for data that uses demographic groups as units of observation.
The function needs to:
- remove the last row of the table
- get rid of certain patterns
- transition the data to long format
data_prep_dem <- function(TABLE, old_col, new_col, pivot_col){
TABLE <- TABLE[-dim(TABLE)[1],]
TABLE <- TABLE %>%
na_if("nc") %>%
na_if("--") %>%
na_if("") %>%
na_if("*")
TABLE <- TABLE %>%
as_tibble() %>%
rename({{new_col}} := {{old_col}})
partA <- TABLE %>%
dplyr::select({{new_col}})
partB <- TABLE %>%
dplyr::select(-{{new_col}})
partA <- partA %>%
mutate({{new_col}} := partA %>%
dplyr::select({{new_col}}) %>%
pull({{new_col}}) %>%
gsub("[\r\n]|[[:punct:]]|([[:blank:]])\\1+",
"", .))
partA <- partA %>%
mutate({{new_col}} := dplyr::case_when(stringr::str_detect(!!base::as.name(new_col), pattern = "1") ~ base::paste("Age",
stringr::str_sub(!!base::as.name(new_col),
start = 1,
end =2),
stringr::str_sub(!!base::as.name(new_col),
start = 3,
end = 4),
sep="_"),
TRUE ~ !!base::as.name(new_col)))
partB <- partB %>%
mutate(across(.cols = everything(),
str_remove_all, "a")) %>%
mutate(across(.cols = everything(),
str_remove_all, ","))
rm(TABLE)
TABLE <- bind_cols(partA,
partB)
TABLE <- TABLE %>%
pivot_longer(cols = contains("20"), names_to = "Year", values_to = pivot_col)
TABLE
}
I included the following line to help separate the tables.
Table11.2a
We use the produced function to wrangle the next pair of tables.
[1] 18 16
# A tibble: 5 x 2
Demographic n
<chr> <int>
1 AGE GROUP 15
2 AIAN 1
3 GENDER 15
4 HISPANIC ORIGIN AND RACE 15
5 NHOPI 14
I included the following line to help separate the tables.
Table11.2b
[1] 18 16
# A tibble: 5 x 2
Demographic n
<chr> <int>
1 AGE GROUP 15
2 AIAN 1
3 GENDER 15
4 HISPANIC ORIGIN AND RACE 15
5 NHOPI 14
We repeat this process for the remaining tables.
I included the following line to help separate the tables.
Table 11.3a
[1] 18 14
# A tibble: 5 x 2
Demographic n
<chr> <int>
1 AGE GROUP 13
2 AIAN 2
3 GENDER 13
4 HISPANIC ORIGIN AND RACE 13
5 NHOPI 13
I included the following line to help separate the tables.
Table 11.3b
[1] 18 14
# A tibble: 5 x 2
Demographic n
<chr> <int>
1 AGE GROUP 13
2 AIAN 2
3 GENDER 13
4 HISPANIC ORIGIN AND RACE 13
5 NHOPI 13
I included the following line to help separate the tables.
Table 11.4a
[1] 18 16
# A tibble: 7 x 2
Demographic n
<chr> <int>
1 AGE GROUP 15
2 AIAN 15
3 Asian 15
4 GENDER 15
5 HISPANIC ORIGIN AND RACE 15
6 NHOPI 15
7 Two or More Races 12
I included the following line to help separate the tables.
Table 11.4b
[1] 18 16
# A tibble: 7 x 2
Demographic n
<chr> <int>
1 AGE GROUP 15
2 AIAN 15
3 Asian 15
4 GENDER 15
5 HISPANIC ORIGIN AND RACE 15
6 NHOPI 15
7 Two or More Races 12
Now that we’ve wrangled the data, we can go ahead and proceed with our analysis.
Data Analysis
In this section, we only analyzed data from tables 2-4. Data from table 1 is very different than data from tables 2-4. For expediency, I did not include an example with data frome table 1. The following code, however, can easily be repurposed to accomplish that once a specific group has been identified to conduct the test on.
We would like to conduct a chi-squared test for independence.
To conduct this statistical test, we need to produce a 2x2 table.
The following code subsets the data we need and makes the necessary manipulations so that the units of observation are appropriate.
The resulting object is still in long format.
# A tibble: 4 x 3
Demographic Year Number
<chr> <dbl> <dbl>
1 Male 2009 577000
2 Male 2018 946000
3 Female 2009 1377000
4 Female 2018 2537000
To conduct a chi-squared test for indepence we will need a contingency table.
A contingency table can be produced from data in long format by transforming the data to wide format and repurposing some values as row names.
The final object should look like this.
Year2009 Year2018
Male 577000 946000
Female 1377000 2537000
The chi-squared test for independence can be conducted using the stats::chisq.test() function.
Pearson's Chi-squared test with Yates' continuity correction
data: chi_square_11.2a
X-squared = 3482.7, df = 1, p-value < 2.2e-16
We can repeat this process for the remaining tables.
Year2009 Year2018
Male 391000 628000
Female 1013000 1795000
Pearson's Chi-squared test with Yates' continuity correction
data: chi_square_11.3a
X-squared = 1696, df = 1, p-value < 2.2e-16
Year2009 Year2018
Male 168000 351000
Female 505000 1081000
Pearson's Chi-squared test with Yates' continuity correction
data: chi_square_11.4a
X-squared = 50.256, df = 1, p-value = 1.349e-12
Data Visualization
This is the intentionally terrible plot that requires faceting.
table11.1b %>%
ggplot2::ggplot(aes(x = Year, y = Percent, group = MHS_setting)) +
ggplot2::geom_line() +
ggplot2::scale_x_continuous(breaks = seq(2009, 2018, by=1),
labels = seq(2009, 2018, by=1),
limits = c(2009, 2018)) +
ggplot2::labs(title = "Settings Where Mental Health Services Were Received in Past Year\namong Persons Aged 12 to 17",
subtitle = "Percentages, 2002-2018")

The plots below need to be correctly faceted. Keep in mind that tables 11.2+ must be faceted by demographic group type and not by setting type.
table11.1b %>%
ggplot(aes(x = Year, y = Percent, group = MHS_setting)) +
geom_line() +
scale_x_continuous(breaks = seq(2009, 2018, by=1),
labels = seq(2009, 2018, by=1),
limits = c(2009, 2018)) +
labs(title = "Settings Where Mental Health Services Were Received in Past Year\namong Persons Aged 12 to 17",
subtitle = "Percentages, 2002-2018")

table11.2b %>%
ggplot(aes(x = Year, y = Percent, group = Demographic)) +
geom_line() +
scale_x_continuous(breaks = seq(2009, 2018, by=1),
labels = seq(2009, 2018, by=1),
limits = c(2009, 2018)) +
labs(title = "Major Depressive Episode in Past Year\namong Persons Aged 12 to 17",
subtitle = "By Demographic Characteristics, Percentages, 2004-2018")

table11.3b %>%
ggplot(aes(x = Year, y = Percent, group = Demographic)) +
geom_line() +
scale_x_continuous(breaks = seq(2009, 2018, by=1),
labels = seq(2009, 2018, by=1),
limits = c(2009, 2018)) +
labs(title = "Major Depressive Episode with Severe Impairment in Past Year\namong Persons Aged 12 to 17",
subtitle = "By Demographic Characteristics: Percentages, 2006-2018")

table11.4b %>%
ggplot(aes(x = Year, y = Percent, group = Demographic)) +
geom_line() +
scale_x_continuous(breaks = seq(2009, 2018, by=1),
labels = seq(2009, 2018, by=1),
limits = c(2009, 2018)) +
labs(title = "Receipt of Treatment for Depression in Past Year among\nPersons Aged 12 to 17 with Major Depressive Episode in Past Year",
subtitle = "By Demographic Characteristics: Percentages, 2004-2018")

The plots created (after faceting properly) can be used to answer the questions listed at the beginning of the case study. After finalizing the plots, some time should be spent towards framing the visualizations in such a way to underscore how they were used to asnwer the question.
Summary
Suggested Homework
---
title: "Open Case Studies : Mental Health of American Youth"
css: style.css
output:
  html_document:
    self_contained: yes
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes

---
<style>
#TOC {
  background: url("https://opencasestudies.github.io/img/logo.jpg");
  background-size: contain;
  padding-top: 240px !important;
  background-repeat: no-repeat;
}
</style>



```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
                      message = FALSE, warning = FALSE, cache = FALSE,
                      fig.align = "center", out.width = '90%')
library(here)
library(knitr)
library(magick)
```

## {.disclaimer_block}

**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts. 

## {.license_block}

This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"}  United States License.

## {.reference_block}

To cite this case study please use:

Wright, Carrie, and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-co2-emissions. Mental Health of American Youth (Version v1.0.0).


## **Motivation**
*** 

The following papers motivated this case study. 

#### {.reference_block}

Twenge JM, Cooper AB, Joiner TE, Duffy ME, Binau SG. Age, period, and cohort trends in mood disorder indicators and suicide-related outcomes in a nationally representative dataset, 2005-2017. *J Abnorm Psychol*.128,3 (2019):185-199. doi:10.1037/abn0000410


Olfson, M., Blanco, C., Wang, S., Laje, G. & Correll, C. U. National Trends in the Mental Health Care of Children, Adolescents, and Adults by Office-Based Physicians. *JAMA Psychiatry*. 71, 81 (2014):81-90. doi: 10.1001/jamapsychiatry.2013.3074.

####

The main findings of the first [article](https://content.apa.org/record/2019-12578-001){target="_blank"} are:

>Rates of major depressive episode in the last year increased 52% 2005–2017 (from 8.7% to 13.2%) among adolescents aged 12 to 17 and 63% 2009–2017 (from 8.1% to 13.2%) among young adults 18–25. 

>Serious psychological distress in the last month and suicide-related outcomes (suicidal ideation, plans, attempts, and deaths by suicide) in the last year also increased among young adults 18–25 from 2008–2017 (with a 71% increase in serious psychological distress), with less consistent and weaker increases among adults ages 26 and over. 

>Cultural trends contributing to an increase in mood disorders and suicidal thoughts and behaviors since the mid-2000s, including the rise of electronic communication and digital media and declines in sleep duration, may have had a larger impact on younger people, creating a cohort effect.

While the main findings of the second [article](https://pubmed.ncbi.nlm.nih.gov/24285382/){target="_blank"} are:

>Compared with adult mental health care, the mental health
care of young people has increased more rapidly.

>Between 1995-1998 and 2007-2010, visits resulting in mental disorder diagnoses
per 100 population increased significantly faster for youths (from 7.78 to 15.30 visits) than for
adults (from 23.23 to 28.48 visits) (interaction: P < .001). 

>Psychiatrist visits also increased
significantly faster for youths (from 2.86 to 5.71 visits).


While depression appear to be on the rise for youths, youths also appear to be seeking more mental health care.

In this case study we will evaluate data related to depression episodes and mental health care to evaluate trends overtime. We will be using data from the [National Survey on Drug Use and Health (NSDUH)](https://nsduhweb.rti.org/respweb/homepage.cfm). This data was also used in the first study.  


## **Main Questions**
*** 

#### {.main_question_block}
<b><u> Our main questions: </u></b>

1) How have depression rates in American youth changed since 2002, according to the NSDUH data?  
2) Do mental health services appear to be reaching more youths? How have rates differed between different youth subgroups (gender, ethnicity)?

####

## **Learning Objectives** 
*** 

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**It may be a good idea to provide a link to Rstudio's webpage. For the first few months using R, I did not differentiate between R and R Studio. It may be a good distinction to make at least implicitly by providing a link.**

</div>

In this case study, we will determine the percent of youth in America that have had a major depressive episode in the past year since 2002. We will compare how different youth subgroups have changed over time (by age group (12-13,14-15, and 16-17), gender, ethnicity).
We will especially focus on using packages and functions from the [`Tidyverse`](https://www.tidyverse.org/){target="_blank"}, such as [`rvest`](https://github.com/tidyverse/rvest). The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.

```{r, out.width = "20%", echo = FALSE, fig.align ="center"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
```

*** 

We will begin by loading the packages that we will need:

```{r}
library(here)
library(tidyverse)
library(rvest)
```

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**I made some modifications to the table below. The `tidyverse` package hyperlink referenced `readr`. I thought this was incorrect. I changed this to the tidyverse website and provided a different description. If this was indeed a typo, it may need to be fixed in other case studies.**

</div>

 Package   | Use                                                                         
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data
[tidyverse](https://www.tidyverse.org/){target="_blank"}      | R packages for data science
[rvest](https://github.com/tidyverse/rvest){target="_blank"}      | to scrape web pages

The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.

## **Context**
*** 

According to other sources the rate of suicide has increased for most age groups in the United States over the past decade and a half.

```{r, out.width = "80%", echo = FALSE, fig.align ="center"}
include_graphics("https://www.cdc.gov/nchs/images/databriefs/301-350/db309_fig1.png")
```

#### [[source](https://www.cdc.gov/nchs/products/databriefs/db309.htm)]{target="_blank"}


While suicide does appear to be increasing amoung youths it also appears to be increasing amoung middle aged adults as well for both females and males. 

```{r, out.width = "80%", echo = FALSE, fig.align ="center"}
include_graphics("https://www.cdc.gov/nchs/images/databriefs/301-350/db309_fig2.png")
```

#### [[source](https://www.cdc.gov/nchs/products/databriefs/db309.htm)]{target="_blank"}




```{r, out.width = "80%", echo = FALSE, fig.align ="center"}
include_graphics("https://www.cdc.gov/nchs/images/databriefs/301-350/db309_fig3.png")
```

#### [[source](https://www.cdc.gov/nchs/products/databriefs/db309.htm)]{target="_blank"}


According to the [CDC](https://www.cdc.gov/nchs/products/databriefs/db309.htm){target="_blank"}:

> Since 2008, suicide has ranked as the 10th leading cause of death for all ages in the United States. In 2016, suicide became the **second leading cause of death** among those aged **10–34** and the fourth leading cause among those aged 35–54.


**So although sucide is on the rise for most age groups, sucide is one of the top two contributors to death for youths.** This warrents further examination of mental health of American youths.

```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","mortality.png"))
```

#### [[source]](https://www.cdc.gov/nchs/data/databriefs/db293.pdf){target="_blank"}


according to this article:https://www.usatoday.com/story/news/nation/2020/01/30/u-s-suicide-rate-rose-again-2018-how-can-suicide-prevention-save-lives/4616479002/
>Death rates in 2018 increased for only two of the 10 leading causes of death: suicide and influenza/pneumonia. IDk if this is true...

*If you are having thoughts of suicide, please know that you are not alone. If you are in danger of acting on suicidal thoughts, call 911. For support and resources, call the National Suicide Prevention Lifeline at 1-800-273-8255 or text 741-741 for the Crisis Text Line.*

I took this from an article.https://www.theatlantic.com/health/archive/2020/06/why-suicide-rates-among-millennials-are-rising/612943/

*If you or someone you know may be struggling with suicidal thoughts, you can call the U.S. National Suicide Prevention Lifeline at 800-273-TALK (8255) any time day or night, or chat online.*
I thook thos from this article. https://www.usatoday.com/story/news/nation/2020/01/30/u-s-suicide-rate-rose-again-2018-how-can-suicide-prevention-save-lives/4616479002/




covid:https://wellbeingtrust.org/areas-of-focus/policy-and-advocacy/reports/projected-deaths-of-despair-during-covid-19/

Historically, suicide rates were much higher before 1950, however, we are seeing an increase in the last 20 years.

```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","suicide.png"))
```

#### [[source]](https://time.com/5609124/us-suicide-rate-increase/){target="_blank"}





Besides the US, [other countries](https://academic.oup.com/ije/article/48/5/1650/5366210){target="_blank"} are also experiencing increased reates of depression in youths. See [this report](https://apps.who.int/iris/bitstream/handle/10665/254610/WHO-MSD-MER-2017.2-eng.pdf;jsessionid=E44360055DD83EAC472AA40C2853DBFA?sequence=1){target="_blank"} from the  World Health Organization about rates of depression in other countries.

Great paper about what may be causing increased dpression - and the caveats of if we actually have increased depression: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3330161/



help:https://www.mhanational.org/depression-teens-0

https://www.nimh.nih.gov/health/publications/teen-depression/index.shtml

## **Limitations**
*** 

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**Perhaps "underestimates in the p-values..." is not the correct way to phrase this. I would look for a better way to word this.**

</div>

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**Wording for this section should be reviewed.**

</div>

There are some important considerations regarding this data analysis to keep in mind: 

1) We treat sample estimates—estimates of the true population value—as observed values. This produces understimates in the p-values of statistical tests conducted.

2) Furthermore, the sampling mechanism utilized can introduce [selection bias](https://en.wikipedia.org/wiki/Selection_bias?oldformat=true){target="_blank"} in cases where the the [sampling methods do not produce a representative sample](https://en.wikipedia.org/wiki/Sampling_(statistics)?oldformat=true){target="_blank"}. 

3) Data is collected from human participants; this presents the *potential* for information bias, as there is the *potential* that partificipants in the [sampling frame](https://en.wikipedia.org/wiki/Sampling_frame?oldformat=true){target="_blank"} may for a variety of reasons report inaccurate information. 

## **What are the data?**
*** 

The data comes from the [National Survey on Drug Use and Health (NSDUH)](https://nsduhweb.rti.org/respweb/homepage.cfm){target="_blank"} which is directed by the [Substance Abuse and Mental Health Services Administration (SAMHSA)](https://www.samhsa.gov/){target="_blank"}, an agency in the [U.S. Department of Health and Human Services (DHHS)](https://www.hhs.gov/){target="_blank"}. 

This survey started in 1971 and is conducted annual in all 50 states and the District of Columbia.

This information is used for disease surveillance and to guide public policy. 

This data is made available publicly online on the [Substance Abuse & Mental Health Data Archive](https://datafiles.samhsa.gov/){target="_blank"}. 

```{r, out.width = "100%", echo = FALSE, fig.align ="center"}
include_graphics(here("nsudh_screenshot_webpage.png"))
```

## **Data Import**
*** 

Data is often made available online. Usually, the data we are interested in is made available for download on the page as a delimited text file. However, sometimes data is not made available in this manner.

How do we proceed in this scenario?

We can manually copy each cell of data, however, this process is often inefficient, subject to error, and not reproducible. 

We can also use `R` for web scraping. 

[Web scraping](https://en.wikipedia.org/wiki/Web_scraping?oldformat=true){target="_blank"} is the process of extracting data from a website.

There are two main steps to web scraping:  

1. Identify location of data that will be scraped  

2. Save the webpage element to an object  

We accomplish STEP 1 with our web browser.

We accomplish STEP 2 in the `R` programming environment. 

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**I could not find the animation that I referred to on several occasions.**

**However, I was able to find the sources that I consulted to create the three step `rvest` process. They are included below**

[RStudio](https://rstudio-pubs-static.s3.amazonaws.com/266430_f3fd4660b2744751ab144aa130768a06.html){target="_blank"}

[Blog](http://blog.corynissen.com/2015/01/using-rvest-to-scrape-html-table.html){target="_blank"}

</div>

The `rvest`  package can be thought of as the `pdftools` package for webscraping. Upon pulling the data, additional wrangling will likely be required; but like the `pdftools` package, `rvest` streamlines the extraction process.  

The two steps can be broken down even further: 

1) Identify location of data that will be scraped

+ right-click to inspect element (webpage)
+ hover pointer over components of element (webpage) until the data has been found
+ copy Xpath of data sought

3) Save webpage element to an object

+ import html code for element (webpage)
+ extract pieces (table) out of HTML documents (webpage) using Xpath
+ parse the html table into a data frame

Below is a animated overview of the process.

```{r, eval=FALSE, echo=FALSE}
step1 <- image_read(here("webpage_screenshot.png"))
step2 <- image_read(here("table_screenshot_inspect.png"))
step3 <- image_read(here("table_screenshot_inspect_table.png"))
step4 <- image_read(here("table_screenshot_inspect_table_xpath.png"))
step5 <- image_read(here("table_screenshot_xpath_copy_r.png"))
step5_zoom <- image_read(here("table_screenshot_xpath_copy_r_zoom.png"))

image_info(step5_zoom)

step5_zoom <- image_border(step5_zoom, "white", "284x334")

img <- c(step1,
         step2,
         step2,
         step3,
         step3,
         step4,
         step4,
         step5,
         step5,
         step5_zoom,
         step5_zoom,
         step5_zoom,
         step1)

educational_gif <- image_resize(img, '1440x900!') %>%
  image_background('white') %>%
  image_morph(frames = 10) %>%
  image_animate(delay = 20,
                optimize = TRUE)

image_write(educational_gif, "educational.gif")
```

```{r, echo=FALSE,eval=FALSE}
image_read(here("educational.gif"))
```

```{r, echo=FALSE}
step1 <- image_read(here("webpage_screenshot.png"))
step2 <- image_read(here("table_screenshot_inspect.png"))
step3 <- image_read(here("table_screenshot_inspect_table.png"))
step4 <- image_read(here("table_screenshot_inspect_table_xpath.png"))
step5 <- image_read(here("table_screenshot_xpath_copy_r.png"))
step5_zoom <- image_read(here("table_screenshot_xpath_copy_r_zoom.png"))
```

[Let's go to the web page with all the tables we are interested in scraping.](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm)

```{r, echo=FALSE}
step1
```

Once on the webpage, there aren't any visible options to download the data. 

Right-click and select "Inspect" 

```{r, echo=FALSE}
step2
```

A window opens. 

This window allows us to glance at the internal mechanics of the webpage. To scrape the data from the webpage, we need to first learn a little bit about the components that make it thet web page it is. 

Hovering our mouse over the elements of the webpage highlights the respective section of the webpage it represents. By hovering over several elements—and opening elements when the highlighted portion is too large—we can indentify the element that contains the data we are looking for. 

```{r, echo=FALSE}
step3
```

Right click on the element and copy the XPath. We will need this XPath for the next step.

```{r, echo=FALSE}
step4
```

Now we can return to the `R` programming environment

```{r, echo=FALSE}
step5
```

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**I included the following line to help separate the process.**

</div>

***

**2) Save webpage element to an object** 

For the first question we intend to answer, the XPath is `/html/body/div[4]/div[1]/table`. We use this Xpath with functions from the `rvest` package to scrape data from the web.

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**I wanted to include the last slide/component of the GIF. However, I realized that the audience would also benefit from having an actual code chunk. As a result, this section may need some very minor reworking.**

</div>

```{r, echo=FALSE}
step5_zoom
```

We need to:

+ import html code for element (webpage)
+ extract pieces (table) out of HTML documents (webpage) using Xpath
+ parse the html table into a data frame

To do this:

+ We import the html code using `rvest::read_html()`.
+ We extract specific components of the webpage using `rvest::read_node()`.
+ We convert this html table into a dataframe using `rvest::html_table()`.

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**The `rvest` package provides wrappers for the `xml2` and `httr` packages. I was not sure whether to tag the following functions as rvest or `xml2`/`httr`. I will leave that decision to you..**

</div>

```{r}
url11.1a <- "https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm"
table11.1a <- url11.1a %>%
  read_html() %>%
  html_nodes(xpath='/html/body/div[4]/div[1]/table') %>%
  html_table()
table11.1a <- table11.1a[[1]]
```

Great! We have successfully scraped the data.

From here on, we will need to wrangle the data.

First, we need to repeat the above process for the other tables we are interested in. 

We can create a function to accomplish this succinctly. 

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**For some odd reason, calling the `function()` function with the `base::` prefix causes an error.**

</div>

```{r}
scraper <- function(XPATH){
  url <- "https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm"
  table <- url %>%
  read_html() %>%
  html_nodes(xpath=XPATH) %>%
  html_table()
  output <- table[[1]]
  output
}
```

We apply the function we created too the 

```{r}
table11.1b <- scraper(XPATH = "/html/body/div[4]/div[2]/table")
table11.2a <- scraper(XPATH = '/html/body/div[4]/div[3]/table')
table11.2b <- scraper(XPATH = '/html/body/div[4]/div[4]/table')
table11.3a <- scraper(XPATH = '/html/body/div[4]/div[5]/table')
table11.3b <- scraper(XPATH = '/html/body/div[4]/div[6]/table')
table11.4a <- scraper(XPATH = '/html/body/div[4]/div[7]/table')
table11.4b <- scraper(XPATH = '/html/body/div[4]/div[8]/table')
```

## **Data Exploration and Wrangling**
*** 

Now that we've imported the data, let's see if we can wrangle a table. Since the data comes from a source that is well-maintained, it is likely that whatever steps we take to wrangle this first table will also be necessary in the wrangling of subsequent tables. This is because well-maintained data sources often format different datasets similarly. We can take advantage of this similarity to speed up the wrangling process. 

**Table11.1a**

```{r}
base::dim(table11.1a)

table11.1a <- table11.1a[-dim(table11.1a)[1],]

table11.1a <- table11.1a %>%
  dplyr::na_if("nc") %>%
  dplyr::na_if("--") %>%
  dplyr::na_if("") %>%
  dplyr::na_if("*")

table11.1a <- table11.1a %>%
  tibble::as_tibble() %>%
  dplyr::rename(MHS_setting = `Setting Where Mental Health ServiceWas Received`)

partA <- table11.1a %>%
  dplyr::select(MHS_setting)

partB <- table11.1a %>%
  select(-MHS_setting)

partA <- partA %>%
  dplyr::mutate(MHS_setting = base::gsub("[[:digit:]]+|[\r\n]|[[:punct:]]|([[:blank:]])\\1+",
                            "",
                            MHS_setting))

partB <- partB %>%
  mutate(dplyr::across(.cols = dplyr::everything(),
                stringr::str_remove_all, "a")) %>%
  mutate(dplyr::across(.cols = dplyr::everything(),
                stringr::str_remove_all, ","))

base::rm(table11.1a)

table11.1a <- dplyr::bind_cols(partA,
                               partB)

table11.1a <- table11.1a %>%
  tidyr::pivot_longer(cols = dplyr::contains("20"), names_to = "Year", values_to = "Number")

table11.1a <- table11.1a %>%
  dplyr::filter(MHS_setting != "General Medicine") %>%
  dplyr::filter(MHS_setting != "Juvenile Justice") #Leading lines with no data

table11.1a <- table11.1a %>%
  mutate(across(c(Year, Number), as.numeric))
```

We will write a function to simplify this process.

The function needs to:

- remove the last row of the table
- get rid of certain patterns
- transition the data to long format

```{r}
data_prep_settings <- function(TABLE, old_col, new_col, pivot_col){
  TABLE <- TABLE[-dim(TABLE)[1],]
  TABLE <- TABLE %>%
  na_if("nc") %>%
  na_if("--") %>%
  na_if("") %>%
  na_if("*")
  TABLE <- TABLE %>%
    as_tibble() %>%
    rename({{new_col}} := {{old_col}})
  partA <- TABLE %>%
    select({{new_col}})
  partB <- TABLE %>%
    select(-{{new_col}})
  partA <- partA %>%
  mutate({{new_col}} := partA %>%
           select({{new_col}}) %>%
           dplyr::pull({{new_col}}) %>%
           gsub("[[:digit:]]+|[\r\n]|[[:punct:]]|([[:blank:]])\\1+",
                        "", .))
  partB <- partB %>%
    mutate(across(.cols = everything(),
                str_remove_all, "a")) %>%
    mutate(across(.cols = everything(),
                str_remove_all, ","))
  rm(TABLE)
  TABLE <- bind_cols(partA,
                     partB)
  TABLE <- TABLE %>%
  pivot_longer(cols = contains("20"), names_to = "Year", values_to = pivot_col)
  TABLE
}
```

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**I included the following line to help separate the tables.**

</div>

***

**Table11.1a**

We then apply this function to the table, ridding the table of headings and ensuring some of our commons are correctly of numeric class.

```{r}
dim(table11.1b)

table11.1b <- data_prep_settings(TABLE = table11.1b,
          old_col = "Setting Where Mental Health ServiceWas Received",
          new_col = "MHS_setting",
          pivot_col = "Percent")

table11.1b <- table11.1b %>%
  filter(MHS_setting != "General Medicine") %>%
  filter(MHS_setting != "Juvenile Justice") #Leading lines with no data

table11.1b <- table11.1b %>%
  mutate(across(c(Year, Percent), as.numeric))
```

We write a function to simplify this process for data that uses demographic groups as units of observation.

The function needs to:

- remove the last row of the table
- get rid of certain patterns
- transition the data to long format

```{r}
data_prep_dem <- function(TABLE, old_col, new_col, pivot_col){
  TABLE <- TABLE[-dim(TABLE)[1],]
  TABLE <- TABLE %>%
  na_if("nc") %>%
  na_if("--") %>%
  na_if("") %>%
  na_if("*")
  TABLE <- TABLE %>%
    as_tibble() %>%
    rename({{new_col}} := {{old_col}})
  partA <- TABLE %>%
    dplyr::select({{new_col}})
  partB <- TABLE %>%
    dplyr::select(-{{new_col}})
  partA <- partA %>%
  mutate({{new_col}} := partA %>%
           dplyr::select({{new_col}}) %>%
           pull({{new_col}}) %>%
           gsub("[\r\n]|[[:punct:]]|([[:blank:]])\\1+",
                        "", .))
  partA <- partA %>%
  mutate({{new_col}} := dplyr::case_when(stringr::str_detect(!!base::as.name(new_col), pattern = "1") ~ base::paste("Age",
                                                        stringr::str_sub(!!base::as.name(new_col),
                                                                start = 1,
                                                                end =2),
                                                        stringr::str_sub(!!base::as.name(new_col),
                                                                start = 3,
                                                                end = 4),
                                                        sep="_"),
                                 TRUE ~ !!base::as.name(new_col)))
  partB <- partB %>%
    mutate(across(.cols = everything(),
                str_remove_all, "a")) %>%
    mutate(across(.cols = everything(),
                str_remove_all, ","))
  rm(TABLE)
  TABLE <- bind_cols(partA,
                     partB)
  TABLE <- TABLE %>%
  pivot_longer(cols = contains("20"), names_to = "Year", values_to = pivot_col)
  TABLE
}
```

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**I included the following line to help separate the tables.**

</div>

***

**Table11.2a**

We use the produced function to wrangle the next pair of tables. 

```{r}
dim(table11.2a)

table11.2a <- data_prep_dem(TABLE = table11.2a,
          old_col = "Demographic Characteristic",
          new_col = "Demographic",
        pivot_col = "Number")

table11.2a %>%
  filter(!complete.cases(.)) %>%
  dplyr::group_by(Demographic) %>%
  tally()

table11.2a <- table11.2a %>%
  filter(stats::complete.cases(.) | Demographic == "AIAN")

table11.2a <- table11.2a %>%
  mutate(across(c(Year, Number), as.numeric))
```

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**I included the following line to help separate the tables.**

</div>

***

**Table11.2b**

```{r}
dim(table11.2b)

table11.2b <- data_prep_dem(TABLE = table11.2b,
          old_col = "Demographic Characteristic",
          new_col = "Demographic",
          pivot_col = "Percent")

table11.2b %>%
  filter(!complete.cases(.)) %>%
  group_by(Demographic) %>%
  tally()

table11.2b <- table11.2b %>%
  filter(complete.cases(.) | Demographic == "AIAN")

table11.2b <- table11.2b %>%
  mutate(across(c(Year, Percent), as.numeric))
```

We repeat this process for the remaining tables.

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**I included the following line to help separate the tables.**

</div>

***

**Table 11.3a**

```{r}
dim(table11.3a)

table11.3a <- data_prep_dem(TABLE = table11.3a,
          old_col = "Demographic Characteristic",
          new_col = "Demographic",
          pivot_col = "Number")

table11.3a %>%
  filter(!complete.cases(.)) %>%
  group_by(Demographic) %>%
  tally()

table11.3a <- table11.3a %>%
  filter(complete.cases(.) | Demographic == "AIAN")

table11.3a <- table11.3a %>%
  mutate(across(c(Year, Number), as.numeric))
```

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**I included the following line to help separate the tables.**

</div>

***

**Table 11.3b**

```{r}
dim(table11.3b)

table11.3b <- data_prep_dem(TABLE = table11.3b,
          old_col = "Demographic Characteristic",
          new_col = "Demographic",
          pivot_col = "Percent")

table11.3b %>%
  filter(!complete.cases(.)) %>%
  group_by(Demographic) %>%
  tally()

table11.3b <- table11.3b %>%
  filter(complete.cases(.) | Demographic == "AIAN")

table11.3b <- table11.3b %>%
  mutate(across(c(Year, Percent), as.numeric))
```

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**I included the following line to help separate the tables.**

</div>

***

**Table 11.4a**

```{r}
dim(table11.4a)

table11.4a <- data_prep_dem(TABLE = table11.4a,
          old_col = "Demographic Characteristic",
          new_col = "Demographic",
          pivot_col = "Number")

table11.4a %>%
  filter(!complete.cases(.)) %>%
  group_by(Demographic) %>%
  tally()

table11.4a <- table11.4a %>%
  filter(complete.cases(.) | Demographic == "AIAN")

table11.4a <- table11.4a %>%
  mutate(across(c(Year, Number), as.numeric))
```

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**I included the following line to help separate the tables.**

</div>

***

**Table 11.4b**

```{r}
dim(table11.4b)

table11.4b <- data_prep_dem(TABLE = table11.4b,
          old_col = "Demographic Characteristic",
          new_col = "Demographic",
          pivot_col = "Percent")

table11.4b %>%
  filter(!complete.cases(.)) %>%
  group_by(Demographic) %>%
  tally()

table11.4b <- table11.4b %>%
  filter(complete.cases(.) | Demographic == "AIAN")

table11.4b <- table11.4b %>%
  mutate(across(c(Year, Percent), as.numeric))
```

Now that we've wrangled the data, we can go ahead and proceed with our analysis. 

## **Data Analysis**
*** 

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**In this section, we only analyzed data from tables 2-4. Data from table 1 is very different than data from tables 2-4. For expediency, I did not include an example with data frome table 1. The following code, however, can easily be repurposed to accomplish that once a specific group has been identified to conduct the test on.**

</div>

We would like to conduct a [chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test?oldformat=true) for independence. 

To conduct this statistical test, we need to produce a 2x2 table.

The following code subsets the data we need and makes the necessary manipulations so that the units of observation are appropriate. 

```{r}
chi_square_11.2a <- table11.2a %>%
  filter(Year %in% c(2009, 2018)) %>%
  filter(Demographic %in% c("Male","Female")) %>%
  mutate(Number = Number * 1000)
```

The resulting object is still in long format.

```{r}
chi_square_11.2a
```

To conduct a chi-squared test for indepence we will need a [contingency table](https://en.wikipedia.org/wiki/Contingency_table?oldformat=true). 

A contingency table can be produced from data in long format by transforming the data to wide format and repurposing some values as row names. 

```{r}
chi_square_11.2a <- chi_square_11.2a %>%
  tidyr::pivot_wider(names_from = Year,
              names_prefix = "Year", 
              values_from = Number) %>%
  tibble::column_to_rownames("Demographic")
```

The final object should look like this. 

```{r}
chi_square_11.2a
```

The chi-squared test for independence can be conducted using the `stats::chisq.test()` function. 

```{r}
stats::chisq.test(chi_square_11.2a)
```

We can repeat this process for the remaining tables.

```{r}
chi_square_11.3a <- table11.3a %>%
  filter(Year %in% c(2009, 2018)) %>%
  filter(Demographic %in% c("Male","Female")) %>%
  mutate(Number = Number * 1000)

chi_square_11.3a <- chi_square_11.3a %>%
  pivot_wider(names_from = Year,
              names_prefix = "Year", 
              values_from = Number) %>%
  column_to_rownames("Demographic")

chi_square_11.3a
```

```{r}
chisq.test(chi_square_11.3a)
```

```{r}
chi_square_11.4a <- table11.4a %>%
  filter(Year %in% c(2009, 2018)) %>%
  filter(Demographic %in% c("Male","Female")) %>%
  mutate(Number = Number * 1000)

chi_square_11.4a <- chi_square_11.4a %>%
  pivot_wider(names_from = Year,
              names_prefix = "Year", 
              values_from = Number) %>%
  column_to_rownames("Demographic")

chi_square_11.4a
```

```{r}
chisq.test(chi_square_11.4a)
```

## **Data Visualization**
*** 

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**This is the intentionally terrible plot that requires faceting.**

</div>

```{r}
table11.1b %>%
  ggplot2::ggplot(aes(x = Year, y = Percent, group = MHS_setting)) +
  ggplot2::geom_line() +
  ggplot2::scale_x_continuous(breaks = seq(2009, 2018, by=1),
                     labels = seq(2009, 2018, by=1),
                     limits = c(2009, 2018)) +
  ggplot2::labs(title = "Settings Where Mental Health Services Were Received in Past Year\namong Persons Aged 12 to 17",
       subtitle = "Percentages, 2002-2018")
```

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**The plots below need to be correctly faceted. Keep in mind that tables 11.2+ must be faceted by demographic group type and not by setting type.**

</div>

```{r}
table11.1b %>%
  ggplot(aes(x = Year, y = Percent, group = MHS_setting)) +
  geom_line() +
  scale_x_continuous(breaks = seq(2009, 2018, by=1),
                     labels = seq(2009, 2018, by=1),
                     limits = c(2009, 2018)) +
  labs(title = "Settings Where Mental Health Services Were Received in Past Year\namong Persons Aged 12 to 17",
       subtitle = "Percentages, 2002-2018")
```

```{r}
table11.2b %>%
  ggplot(aes(x = Year, y = Percent, group = Demographic)) +
  geom_line() +
  scale_x_continuous(breaks = seq(2009, 2018, by=1),
                     labels = seq(2009, 2018, by=1),
                     limits = c(2009, 2018)) +
  labs(title = "Major Depressive Episode in Past Year\namong Persons Aged 12 to 17",
       subtitle = "By Demographic Characteristics, Percentages, 2004-2018")
```

```{r}
table11.3b %>%
  ggplot(aes(x = Year, y = Percent, group = Demographic)) +
  geom_line() +
  scale_x_continuous(breaks = seq(2009, 2018, by=1),
                     labels = seq(2009, 2018, by=1),
                     limits = c(2009, 2018)) +
  labs(title = "Major Depressive Episode with Severe Impairment in Past Year\namong Persons Aged 12 to 17",
       subtitle = "By Demographic Characteristics: Percentages, 2006-2018")
```

```{r}
table11.4b %>%
  ggplot(aes(x = Year, y = Percent, group = Demographic)) +
  geom_line() +
  scale_x_continuous(breaks = seq(2009, 2018, by=1),
                     labels = seq(2009, 2018, by=1),
                     limits = c(2009, 2018)) + 
  labs(title = "Receipt of Treatment for Depression in Past Year among\nPersons Aged 12 to 17 with Major Depressive Episode in Past Year",
       subtitle = "By Demographic Characteristics: Percentages, 2004-2018")
```

<style>
div.red { background-color:#FFE6E6; border-radius: 5px; padding: 20px;}
</style>
<div class = "red">

**The plots created (after faceting properly) can be used to answer the questions listed at the beginning of the case study. After finalizing the plots, some time should be spent towards framing the visualizations in such a way to underscore how they were used to asnwer the question.**

</div>

## **Summary**
*** 

## **Suggested Homework**
*** 

## **Additional Information**
***

### Helpful Links


**This needs to be updated**

[guide](https://briatte.github.io/ggcorr/) for using GGally to create correlation plots

<u>Terms and concepts covered:</u>  

[Tidyverse](https://www.tidyverse.org/){target="_blank"}  
[RStudio cheatsheets](https://rstudio.com/resources/cheatsheets/){target="_blank"}  

<u>Packages used in this case study: </u>

 Package   | Use                                                                         
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data 
[tidyverse](https://www.tidyverse.org/){target="_blank"}      | R packages for data science
[rvest](https://github.com/tidyverse/rvest){target="_blank"}      | to scrape web pages


### Acknowledgements

We would like to acknowledge [Tamar Mendelson](https://www.jhsph.edu/faculty/directory/profile/1770/tamar-mendelson) for assisting in framing the major direction of the case study.

We would also like to acknowledge the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/) for funding this work. 


### **RA Notes**

[This is the motivating article for this case study](https://pubmed.ncbi.nlm.nih.gov/30869927/). In this article, they web scrape to obtain the data they need. 

[Here is the Lieber Institute's resource on web scrape](http://research.libd.org/rstatsclub/post/introduction-to-scraping-and-wranging-tables-from-research-articles/#.Xw878ZNKhQJ)

[Here is a resouce the Lieber Institute source above referse to](http://blog.corynissen.com/2015/01/using-rvest-to-scrape-html-table.html)

[Here as a good resource to learn how to web scrape](https://rstudio-pubs-static.s3.amazonaws.com/266430_f3fd4660b2744751ab144aa130768a06.html)

[This is the set of tables we would like to consider](https://www.samhsa.gov/data/sites/default/files/cbhsq-reports/NSDUHDetailedTabs2018R2/NSDUHDetTabsSect11pe2018.htm)
